1. The customer is a service for selling used cars.
  2. The end result is an app where you can find out the market value of your car.
  3. It is necessary to build a model that manages to define it.
  4. Specifications, equipment and prices of other cars are available.

Criteria that are important for the customer:

  • quality of prediction;
  • model training time;
  • model prediction time.

RMSE is recommended as a quality metric with a desired value of RMSE < 2500.

!pip install distinctipy colorcet pgeocode geopy xgboost
Requirement already satisfied: distinctipy in /home/leonid/anaconda3/envs/practicum/lib/python3.9/site-packages (1.2.2)
Requirement already satisfied: colorcet in /home/leonid/anaconda3/envs/practicum/lib/python3.9/site-packages (3.0.1)
Requirement already satisfied: pgeocode in /home/leonid/anaconda3/envs/practicum/lib/python3.9/site-packages (0.4.0)
Requirement already satisfied: geopy in /home/leonid/anaconda3/envs/practicum/lib/python3.9/site-packages (2.3.0)
Requirement already satisfied: xgboost in /home/leonid/anaconda3/envs/practicum/lib/python3.9/site-packages (1.7.5)
Requirement already satisfied: numpy in /home/leonid/anaconda3/envs/practicum/lib/python3.9/site-packages (from distinctipy) (1.23.5)
Requirement already satisfied: pyct>=0.4.4 in /home/leonid/anaconda3/envs/practicum/lib/python3.9/site-packages (from colorcet) (0.5.0)
Requirement already satisfied: requests in /home/leonid/anaconda3/envs/practicum/lib/python3.9/site-packages (from pgeocode) (2.28.1)
Requirement already satisfied: pandas in /home/leonid/anaconda3/envs/practicum/lib/python3.9/site-packages (from pgeocode) (1.2.4)
Requirement already satisfied: geographiclib<3,>=1.52 in /home/leonid/anaconda3/envs/practicum/lib/python3.9/site-packages (from geopy) (2.0)
Requirement already satisfied: scipy in /home/leonid/anaconda3/envs/practicum/lib/python3.9/site-packages (from xgboost) (1.8.0)
Requirement already satisfied: param>=1.7.0 in /home/leonid/anaconda3/envs/practicum/lib/python3.9/site-packages (from pyct>=0.4.4->colorcet) (1.13.0)
Requirement already satisfied: python-dateutil>=2.7.3 in /home/leonid/anaconda3/envs/practicum/lib/python3.9/site-packages (from pandas->pgeocode) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /home/leonid/anaconda3/envs/practicum/lib/python3.9/site-packages (from pandas->pgeocode) (2022.7)
Requirement already satisfied: certifi>=2017.4.17 in /home/leonid/anaconda3/envs/practicum/lib/python3.9/site-packages (from requests->pgeocode) (2024.8.30)
Requirement already satisfied: charset-normalizer<3,>=2 in /home/leonid/anaconda3/envs/practicum/lib/python3.9/site-packages (from requests->pgeocode) (2.0.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/leonid/anaconda3/envs/practicum/lib/python3.9/site-packages (from requests->pgeocode) (1.26.14)
Requirement already satisfied: idna<4,>=2.5 in /home/leonid/anaconda3/envs/practicum/lib/python3.9/site-packages (from requests->pgeocode) (3.4)
Requirement already satisfied: six>=1.5 in /home/leonid/anaconda3/envs/practicum/lib/python3.9/site-packages (from python-dateutil>=2.7.3->pandas->pgeocode) (1.16.0)
!pip install seaborn --upgrade
Requirement already satisfied: seaborn in /home/leonid/anaconda3/envs/practicum/lib/python3.9/site-packages (0.13.2)
Requirement already satisfied: numpy!=1.24.0,>=1.20 in /home/leonid/anaconda3/envs/practicum/lib/python3.9/site-packages (from seaborn) (1.23.5)
Requirement already satisfied: pandas>=1.2 in /home/leonid/anaconda3/envs/practicum/lib/python3.9/site-packages (from seaborn) (1.2.4)
Requirement already satisfied: matplotlib!=3.6.1,>=3.4 in /home/leonid/anaconda3/envs/practicum/lib/python3.9/site-packages (from seaborn) (3.7.0)
Requirement already satisfied: contourpy>=1.0.1 in /home/leonid/anaconda3/envs/practicum/lib/python3.9/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.0.5)
Requirement already satisfied: pillow>=6.2.0 in /home/leonid/anaconda3/envs/practicum/lib/python3.9/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (9.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /home/leonid/anaconda3/envs/practicum/lib/python3.9/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (3.0.9)
Requirement already satisfied: cycler>=0.10 in /home/leonid/anaconda3/envs/practicum/lib/python3.9/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (0.11.0)
Requirement already satisfied: packaging>=20.0 in /home/leonid/anaconda3/envs/practicum/lib/python3.9/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (22.0)
Requirement already satisfied: fonttools>=4.22.0 in /home/leonid/anaconda3/envs/practicum/lib/python3.9/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (4.25.0)
Requirement already satisfied: python-dateutil>=2.7 in /home/leonid/anaconda3/envs/practicum/lib/python3.9/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (2.8.2)
Requirement already satisfied: kiwisolver>=1.0.1 in /home/leonid/anaconda3/envs/practicum/lib/python3.9/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.4.4)
Requirement already satisfied: importlib-resources>=3.2.0 in /home/leonid/anaconda3/envs/practicum/lib/python3.9/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (5.2.0)
Requirement already satisfied: pytz>=2017.3 in /home/leonid/anaconda3/envs/practicum/lib/python3.9/site-packages (from pandas>=1.2->seaborn) (2022.7)
Requirement already satisfied: zipp>=3.1.0 in /home/leonid/anaconda3/envs/practicum/lib/python3.9/site-packages (from importlib-resources>=3.2.0->matplotlib!=3.6.1,>=3.4->seaborn) (3.11.0)
Requirement already satisfied: six>=1.5 in /home/leonid/anaconda3/envs/practicum/lib/python3.9/site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.4->seaborn) (1.16.0)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
import lightgbm as lgb
import sklearn
import pgeocode
import catboost
import xgboost

from geopy.geocoders import Nominatim
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.metrics import mean_squared_error as mse
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV
from sklearn.ensemble import RandomForestRegressor 
from sklearn.linear_model import LinearRegression
from sklearn.dummy import DummyRegressor
from catboost import CatBoostRegressor
sb.__version__
'0.13.2'

Seaborn 0.12.0+ is needed.

Primary data analysis

try:
    data = pd.read_csv('autos.csv')
except:
    data = pd.read_csv('/datasets/autos.csv')
data
DateCrawled Price VehicleType RegistrationYear Gearbox Power Model Kilometer RegistrationMonth FuelType Brand Repaired DateCreated NumberOfPictures PostalCode LastSeen
0 2016-03-24 11:52:17 480 NaN 1993 manual 0 golf 150000 0 petrol volkswagen NaN 2016-03-24 00:00:00 0 70435 2016-04-07 03:16:57
1 2016-03-24 10:58:45 18300 coupe 2011 manual 190 NaN 125000 5 gasoline audi yes 2016-03-24 00:00:00 0 66954 2016-04-07 01:46:50
2 2016-03-14 12:52:21 9800 suv 2004 auto 163 grand 125000 8 gasoline jeep NaN 2016-03-14 00:00:00 0 90480 2016-04-05 12:47:46
3 2016-03-17 16:54:04 1500 small 2001 manual 75 golf 150000 6 petrol volkswagen no 2016-03-17 00:00:00 0 91074 2016-03-17 17:40:17
4 2016-03-31 17:25:20 3600 small 2008 manual 69 fabia 90000 7 gasoline skoda no 2016-03-31 00:00:00 0 60437 2016-04-06 10:17:21
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
354364 2016-03-21 09:50:58 0 NaN 2005 manual 0 colt 150000 7 petrol mitsubishi yes 2016-03-21 00:00:00 0 2694 2016-03-21 10:42:49
354365 2016-03-14 17:48:27 2200 NaN 2005 NaN 0 NaN 20000 1 NaN sonstige_autos NaN 2016-03-14 00:00:00 0 39576 2016-04-06 00:46:52
354366 2016-03-05 19:56:21 1199 convertible 2000 auto 101 fortwo 125000 3 petrol smart no 2016-03-05 00:00:00 0 26135 2016-03-11 18:17:12
354367 2016-03-19 18:57:12 9200 bus 1996 manual 102 transporter 150000 3 gasoline volkswagen no 2016-03-19 00:00:00 0 87439 2016-04-07 07:15:26
354368 2016-03-20 19:41:08 3400 wagon 2002 manual 100 golf 150000 6 gasoline volkswagen NaN 2016-03-20 00:00:00 0 40764 2016-03-24 12:45:21

354369 rows × 16 columns

Features:

  • DateCrawled — date of downloading the questionnaire from the database
  • VehicleType — vehicle body type
  • RegistrationYear — vehicle registration year
  • Gearbox — gearbox type
  • Power — power (hp)
  • Model — car model
  • Kilometer — mileage (km)
  • RegistrationMonth — vehicle registration month
  • FuelType — fuel type
  • Brand — car brand
  • Repaired — whether or not the car has been repaired
  • DateCreated — questionnaire creation date
  • NumberOfPictures — number of photos of the vehicle
  • PostalCode — postal code of the questionnaire owner (user)
  • LastSeen — date of last user activity

Target:

  • Price — price (euro)

The plan for the analysis is as follows:

  1. Construct bar charts for the qualitative variables
  2. Construct histograms for quantitative variables
  3. Construct whisker boxes for quantitative variables
  4. View price trends by day
  5. Estimate the number of missing values
  6. Count duplicates
  7. Estimate the impact of emissions

Qualitative variables

data.dtypes
DateCrawled          object
Price                 int64
VehicleType          object
RegistrationYear      int64
Gearbox              object
Power                 int64
Model                object
Kilometer             int64
RegistrationMonth     int64
FuelType             object
Brand                object
Repaired             object
DateCreated          object
NumberOfPictures      int64
PostalCode            int64
LastSeen             object
dtype: object

Let's see what values occur in the columns:

  • VehicleType
  • Gearbox
  • Model
  • RegistrationMonth
  • FuelType
  • Brand
  • Repaired
  • PostalCode
def barplot_from_counts(series, by_index=False):
    from matplotlib import patches as mpatches
    from matplotlib.ticker import MaxNLocator

    if len(series) <= 9:
        colors = plt.colormaps.get_cmap('Set1').colors
    else:
        import colorcet
        colors = colorcet.glasbey[:len(series)] 
        # import distinctipy
        # colors = distinctipy.get_colors(len(series), pastel_factor=0.57)
        
    if by_index:
        series = series.sort_index()
    
    series.index = series.index.map(str)
    legend = []    
    for row in range(len(series)):
        legend.append(mpatches.Patch(color=colors[row], 
            label='{}: {}'.format(series.index[row], series.iloc[row])))
    fig, ax = plt.subplots(figsize=(12, 9))
    plt.bar(x=series.index, 
            height=series.values, 
            color=colors)
    ax.xaxis.set_major_locator(MaxNLocator(integer=True))
    plt.title(series.name, fontsize=14)
    plt.xticks(fontsize=14)
    plt.yticks(fontsize=14)
    plt.legend(handles=legend, fontsize=14)
    plt.show()

Vehicle body type

data['VehicleType'].isna().sum()
37490
data['VehicleType'].isna().sum() / len(data)
0.1057936783409384

10.57% of cars have an unknown body type.

Let's look at the rest of them.

barplot_from_counts(data['VehicleType'].value_counts())

There is little difference between other and NaN. Let's replace missing values with "other" at the preprocessing stage.

Gearbox

data['Gearbox'].isna().sum()
19833
barplot_from_counts(data['Gearbox'].value_counts())

This is a significant imbalance. We can try to take it into account at the stage of preparing samples. If it takes too long to work on the whole dataset.

Car model

data['Model'].isna().sum()
19705
data['Model'].value_counts()
golf                  29232
other                 24421
3er                   19761
polo                  13066
corsa                 12570
                      ...  
serie_2                   8
rangerover                4
serie_3                   4
range_rover_evoque        2
serie_1                   2
Name: Model, Length: 250, dtype: int64

There are 250 models in total. Here also all the missing values will pass to the "other" category. I wonder if these models are unambiguously categorized by brand.

data.loc[data['Model'] == 'golf']['Brand'].unique()
array(['volkswagen'], dtype=object)

All cars of the golf model belong to the Volkswagen brand. At the EDA stage we will have to check this for the others as well.

Month of registration

barplot_from_counts(data['RegistrationMonth'].value_counts(), by_index=True)

Maybe there is some seasonal logic here, but what is important is how the month affects the price. What is really puzzling is that there are 13 months. Apparently month 0 is an unspecified one.

Fuel type

data['FuelType'].isna().sum()
32895
data['FuelType'].value_counts()
petrol      216352
gasoline     98720
lpg           5310
cng            565
hybrid         233
other          204
electric        90
Name: FuelType, dtype: int64

Here it is not so clear that missing values should be transferred to the category "other". Nevertheless, let's do it. It will be necessary to check how often the values of the category "other" will coincide across the lines.

Car brand

data['Brand'].isna().sum()
0
barplot_from_counts(data['Brand'].value_counts())

Volkswagen is clearly in the lead. Das auto.

Breakdown history

data['Repaired'].isna().sum()
71154
barplot_from_counts(data['Repaired'].value_counts())

I wonder who checks this. Because if the owners themselves indicate, these numbers may not reflect reality.

Postal code

Despite the fact that there are a lot of values here, it is still a categorical variable. There is no way to make a bar chart, so let's try to estimate it without a graph.

data['PostalCode'].value_counts()
10115    819
65428    613
66333    343
32257    317
44145    317
        ... 
21782      1
9517       1
29367      1
38325      1
82404      1
Name: PostalCode, Length: 8143, dtype: int64

The distribution is clearly uneven. On the other hand, different numbers of people with different numbers of cars may live in the areas of two zip codes.

Maybe we can group it by area?

%%time
list_of_states = []
for postal_code in ['10115']:
    for country in pgeocode.COUNTRIES_VALID:
        nomi = pgeocode.Nominatim(country)
        if not pd.isna(nomi.query_postal_code(postal_code)['state_name']):
            list_of_states.append(nomi.query_postal_code(postal_code)['state_name'])
list_of_states
CPU times: user 6.39 s, sys: 896 ms, total: 7.29 s
Wall time: 7.54 s
['Berlin', 'Harju maakond', '경기도', 'Western Province', 'New York']

Apparently not. It is not clear which country to put in the category.

Quantitative variables

data.dtypes
DateCrawled          object
Price                 int64
VehicleType          object
RegistrationYear      int64
Gearbox              object
Power                 int64
Model                object
Kilometer             int64
RegistrationMonth     int64
FuelType             object
Brand                object
Repaired             object
DateCreated          object
NumberOfPictures      int64
PostalCode            int64
LastSeen             object
dtype: object

Let's look at the columns

  • Price
  • RegistrationYear
  • Power
  • Kilometer
  • NumberOfPictures
def hist_from_series(series):
    # sb.set(style="ticks")
    fig, (ax_box, ax_hist) = plt.subplots(2, sharex=True, figsize=(12, 9), 
                                      gridspec_kw={"height_ratios": (.20, .8)})
    box = sb.boxplot(x=series, ax=ax_box)
    hist = sb.histplot(x=series, ax=ax_hist)
    
    box.set_xlabel(None)
    box.tick_params(left=False, labelsize=15)
    sb.despine(ax=ax_box, left=False, right=False, top=False, bottom=True)
    
    hist.set_xlabel(series.name, fontsize=15)
    hist.set_ylabel('Count', fontsize=15)
    hist.tick_params(labelsize=15)
    sb.despine(ax=ax_hist, right=False)
    
    plt.show()

Price

series = data['Price']
hist_from_series(series)
series.value_counts()
0        10772
500       5670
1500      5394
1000      4649
1200      4594
         ...  
13180        1
10879        1
2683         1
634          1
8188         1
Name: Price, Length: 3731, dtype: int64

Those zeros will have to be replaced with something. The median of the year of manufacture or something. On the other hand, 11.000 is about 3%, we can throw it away.

The year of registration

series = data['RegistrationYear']
series.value_counts()
2000    24490
1999    22728
2005    22109
2001    20124
2006    19900
        ...  
4100        1
1200        1
5300        1
8888        1
2290        1
Name: RegistrationYear, Length: 151, dtype: int64

There are clearly anomalies here.

hist_from_series(series.loc[(series.quantile(0.001) < series) & 
                            (series < series.quantile(0.999))])
np.sort(series.unique())
array([1000, 1001, 1039, 1111, 1200, 1234, 1253, 1255, 1300, 1400, 1500,
       1600, 1602, 1688, 1800, 1910, 1915, 1919, 1920, 1923, 1925, 1927,
       1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938,
       1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950,
       1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961,
       1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972,
       1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983,
       1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994,
       1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005,
       2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016,
       2017, 2018, 2019, 2066, 2200, 2222, 2290, 2500, 2800, 2900, 3000,
       3200, 3500, 3700, 3800, 4000, 4100, 4500, 4800, 5000, 5300, 5555,
       5600, 5900, 5911, 6000, 6500, 7000, 7100, 7500, 7800, 8000, 8200,
       8455, 8500, 8888, 9000, 9229, 9450, 9996, 9999])

Even though everything is clear with the top bar, the cutoff in the 20th century is not clear at all.

Engine power

series = data['Power']
series.value_counts()
0        40225
75       24023
60       15897
150      14590
101      13298
         ...  
323          1
3454         1
1056         1
13636        1
1158         1
Name: Power, Length: 712, dtype: int64
np.sort(series.unique())
array([    0,     1,     2,     3,     4,     5,     6,     7,     8,
           9,    10,    11,    12,    13,    14,    15,    16,    17,
          18,    19,    20,    21,    22,    23,    24,    25,    26,
          27,    28,    29,    30,    31,    32,    33,    34,    35,
          36,    37,    38,    39,    40,    41,    42,    43,    44,
          45,    46,    47,    48,    49,    50,    51,    52,    53,
          54,    55,    56,    57,    58,    59,    60,    61,    62,
          63,    64,    65,    66,    67,    68,    69,    70,    71,
          72,    73,    74,    75,    76,    77,    78,    79,    80,
          81,    82,    83,    84,    85,    86,    87,    88,    89,
          90,    91,    92,    93,    94,    95,    96,    97,    98,
          99,   100,   101,   102,   103,   104,   105,   106,   107,
         108,   109,   110,   111,   112,   113,   114,   115,   116,
         117,   118,   119,   120,   121,   122,   123,   124,   125,
         126,   127,   128,   129,   130,   131,   132,   133,   134,
         135,   136,   137,   138,   139,   140,   141,   142,   143,
         144,   145,   146,   147,   148,   149,   150,   151,   152,
         153,   154,   155,   156,   157,   158,   159,   160,   161,
         162,   163,   164,   165,   166,   167,   168,   169,   170,
         171,   172,   173,   174,   175,   176,   177,   178,   179,
         180,   181,   182,   183,   184,   185,   186,   187,   188,
         189,   190,   191,   192,   193,   194,   195,   196,   197,
         198,   199,   200,   201,   202,   203,   204,   205,   206,
         207,   208,   209,   210,   211,   212,   213,   214,   215,
         216,   217,   218,   219,   220,   221,   222,   223,   224,
         225,   226,   227,   228,   229,   230,   231,   232,   233,
         234,   235,   236,   237,   238,   239,   240,   241,   242,
         243,   244,   245,   246,   247,   248,   249,   250,   251,
         252,   253,   254,   255,   256,   257,   258,   259,   260,
         261,   262,   264,   265,   266,   267,   268,   269,   270,
         271,   272,   273,   274,   275,   276,   277,   278,   279,
         280,   281,   282,   283,   284,   285,   286,   287,   288,
         289,   290,   292,   293,   294,   295,   296,   297,   298,
         299,   300,   301,   303,   304,   305,   306,   307,   308,
         309,   310,   311,   313,   314,   315,   316,   317,   318,
         320,   321,   322,   323,   324,   325,   326,   327,   328,
         329,   330,   331,   332,   333,   334,   335,   336,   337,
         338,   339,   340,   341,   343,   344,   345,   346,   347,
         348,   349,   350,   351,   352,   353,   354,   355,   356,
         357,   358,   360,   361,   362,   363,   364,   365,   367,
         368,   370,   371,   374,   375,   376,   377,   379,   380,
         381,   382,   385,   386,   387,   388,   390,   392,   394,
         396,   398,   399,   400,   401,   402,   405,   408,   409,
         411,   416,   420,   421,   425,   426,   428,   430,   431,
         435,   440,   442,   445,   449,   450,   454,   457,   459,
         460,   475,   476,   485,   487,   489,   490,   500,   504,
         505,   507,   508,   510,   514,   515,   517,   519,   520,
         521,   525,   530,   540,   541,   544,   550,   551,   553,
         560,   572,   574,   579,   580,   584,   585,   599,   600,
         601,   602,   603,   604,   606,   607,   610,   612,   620,
         640,   645,   650,   651,   671,   678,   682,   685,   696,
         700,   702,   703,   732,   743,   750,   751,   754,   771,
         776,   800,   805,   808,   850,   851,   871,   900,   901,
         902,   903,   907,   909,   923,   950,   952,   953,   960,
         998,   999,  1000,  1001,  1002,  1003,  1004,  1005,  1011,
        1012,  1016,  1017,  1021,  1024,  1054,  1055,  1056,  1062,
        1079,  1082,  1090,  1100,  1102,  1103,  1105,  1111,  1115,
        1120,  1149,  1151,  1158,  1160,  1162,  1164,  1199,  1200,
        1202,  1221,  1223,  1230,  1239,  1240,  1241,  1250,  1252,
        1256,  1275,  1288,  1299,  1300,  1312,  1317,  1324,  1339,
        1351,  1360,  1362,  1363,  1367,  1390,  1394,  1398,  1399,
        1400,  1401,  1403,  1405,  1416,  1432,  1433,  1436,  1500,
        1501,  1502,  1503,  1506,  1521,  1548,  1595,  1596,  1597,
        1598,  1600,  1625,  1631,  1653,  1659,  1689,  1700,  1701,
        1703,  1704,  1707,  1753,  1771,  1779,  1780,  1781,  1783,
        1793,  1796,  1799,  1800,  1801,  1870,  1895,  1896,  1900,
        1910,  1920,  1922,  1933,  1937,  1968,  1986,  1988,  1992,
        1993,  1995,  1998,  1999,  2000,  2004,  2005,  2007,  2009,
        2016,  2017,  2018,  2172,  2200,  2201,  2331,  2340,  2389,
        2402,  2461,  2598,  2729,  2789,  2792,  2799,  3000,  3199,
        3454,  3500,  3750,  4400,  4507,  4700,  5000,  5411,  5420,
        5575,  5809,  5815,  5867,  6006,  6010,  6011,  6012,  6018,
        6045,  6062,  6226,  6512,  6920,  7508,  7511,  7512,  7515,
        7518,  7529,  7544,  8011,  8259,  8404,  8500,  9000,  9007,
        9010,  9011,  9012,  9013,  9710, 10000, 10110, 10218, 10311,
       10317, 10520, 10522, 10710, 10910, 10912, 11011, 11025, 11111,
       11509, 11530, 11635, 12012, 12510, 12512, 12684, 13616, 13636,
       14009, 15001, 15016, 15017, 15020, 15033, 16011, 16051, 16311,
       16312, 17011, 17019, 17410, 17700, 17932, 19208, 19211, 19312,
       20000])

Nothing is clear, but the zeros are to be replaced with something.

hist_from_series(series.loc[(series.quantile(0.001) < series) & 
                            (series < series.quantile(0.999))])

I think everything above 400 hp would have to be rounded up to the hundreds.

Mileage

series = data['Kilometer']
series.value_counts()
150000    238209
125000     36454
100000     14882
90000      11567
80000      10047
70000       8593
60000       7444
5000        6397
50000       6232
40000       4911
30000       4436
20000       3975
10000       1222
Name: Kilometer, dtype: int64
hist_from_series(series)

The numbers look drawn. I will have to check their correlation with the price before training the model on them.

Number of photos

series = data['NumberOfPictures']
series.value_counts()
0    354369
Name: NumberOfPictures, dtype: int64

That column should be thrown away altogether. Or we shall get some real data elsewhere.

Price dynamics

downloaded = pd.to_datetime(data['DateCrawled'], format='%Y-%m-%d %H:%M:%S')
created = pd.to_datetime(data['DateCreated'], format='%Y-%m-%d %H:%M:%S')
visited = pd.to_datetime(data['LastSeen'], format='%Y-%m-%d %H:%M:%S')
downloaded.dt.round('d')
0        2016-03-24
1        2016-03-24
2        2016-03-15
3        2016-03-18
4        2016-04-01
            ...    
354364   2016-03-21
354365   2016-03-15
354366   2016-03-06
354367   2016-03-20
354368   2016-03-21
Name: DateCrawled, Length: 354369, dtype: datetime64[ns]

Looks like about a month is covered.

sb.lineplot(x=downloaded.dt.round('d'), y=data['Price'])
plt.xticks(rotation=30)
plt.show()

It's not clear to me how the day the questionnaire is downloaded can affect the price, but judging by the graph, there is some dependence.

sb.lineplot(x=created, y=data['Price'])
plt.xticks(rotation=30)
plt.show()

The fluctuations are most likely due to the number of offers.

created.dt.round('d').hist(bins=50)
plt.xticks(rotation=30)
plt.show()
created.loc[created > '2016-03-01'].dt.round('d').hist()
plt.xticks(rotation=30)
plt.show()

In principle, the period up to March 2016 can be omitted. It does not reflect the market anyway.

sb.lineplot(x=created.loc[created > '2016-03-01'], y=data['Price'])
plt.xticks(rotation=30)
plt.show()

I am not sure whether there is a linkage here.

sb.lineplot(x=visited.dt.round('d'), y=data['Price'])
plt.xticks(rotation=30)
plt.show()

However, the day of the visit seems to have influence.

Duplicates

data.duplicated().sum()
4
data[data.duplicated()]
DateCrawled Price VehicleType RegistrationYear Gearbox Power Model Kilometer RegistrationMonth FuelType Brand Repaired DateCreated NumberOfPictures PostalCode LastSeen
171088 2016-03-08 18:42:48 1799 coupe 1999 auto 193 clk 20000 7 petrol mercedes_benz no 2016-03-08 00:00:00 0 89518 2016-03-09 09:46:57
231258 2016-03-28 00:56:10 1000 small 2002 manual 83 other 150000 1 petrol suzuki no 2016-03-28 00:00:00 0 66589 2016-03-28 08:46:21
258109 2016-04-03 09:01:15 4699 coupe 2003 auto 218 clk 125000 6 petrol mercedes_benz yes 2016-04-03 00:00:00 0 75196 2016-04-07 09:44:54
325651 2016-03-18 18:46:15 1999 wagon 2001 manual 131 passat 150000 7 gasoline volkswagen no 2016-03-18 00:00:00 0 36391 2016-03-18 18:46:15

Nothing abnormal. Throw it away and forget it.

Data preprocessing.

Let's do it in order:

  1. Renaming of the columns
  2. Removing duplicates
  3. Data type conversion
  4. Work with missing values
  5. Work with anomalous values

Renaming

data.rename(columns={
    'DateCrawled': 'downloaded', 
    'Price': 'price', 
    'VehicleType': 'vehicle_type', 
    'RegistrationYear': 'year', 
    'Gearbox': 'gearbox', 
    'Power': 'power', 
    'Model': 'model', 
    'Kilometer': 'km', 
    'RegistrationMonth': 'month', 
    'FuelType': 'fuel_type', 
    'Brand': 'brand', 
    'Repaired': 'repaired', 
    'DateCreated': 'created', 
    'NumberOfPictures': 'pictures', 
    'PostalCode': 'postal_code', 
    'LastSeen': 'visited'
}, inplace=True)
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   downloaded    354369 non-null  object
 1   price         354369 non-null  int64 
 2   vehicle_type  316879 non-null  object
 3   year          354369 non-null  int64 
 4   gearbox       334536 non-null  object
 5   power         354369 non-null  int64 
 6   model         334664 non-null  object
 7   km            354369 non-null  int64 
 8   month         354369 non-null  int64 
 9   fuel_type     321474 non-null  object
 10  brand         354369 non-null  object
 11  repaired      283215 non-null  object
 12  created       354369 non-null  object
 13  pictures      354369 non-null  int64 
 14  postal_code   354369 non-null  int64 
 15  visited       354369 non-null  object
dtypes: int64(7), object(9)
memory usage: 43.3+ MB

Deduplication

data.duplicated().sum()
4
data = data.drop_duplicates()

No more clear duplicates.

len(data)
354365
len(data[data['downloaded'].duplicated()])
83191
len(data[data.duplicated()])
0

There are repeats among the dates, but there are differences for them in other columns, so they are not duplicates.

Data types

data['downloaded'] = pd.to_datetime(data['downloaded'], format='%Y-%m-%d %H:%M:%S')
data['created'] = pd.to_datetime(data['created'], format='%Y-%m-%d %H:%M:%S')
data['visited'] = pd.to_datetime(data['visited'], format='%Y-%m-%d %H:%M:%S')

There is also an issue of the postal code. We can't use it as a quantitative variable, as the dependency is clearly non-linear. It would be nice to turn it into districts, but pgeocode offers several possibilities and runs very long.

Missing values

data.isna().sum()
downloaded          0
price               0
vehicle_type    37490
year                0
gearbox         19833
power               0
model           19705
km                  0
month               0
fuel_type       32895
brand               0
repaired        71154
created             0
pictures            0
postal_code         0
visited             0
dtype: int64

Let's see how often the missing values match across rows.

with_na = data[['vehicle_type', 'gearbox', 'model', 'fuel_type', 'repaired']]
with_na.isna()
vehicle_type gearbox model fuel_type repaired
0 True False False False True
1 False False True False False
2 False False False False True
3 False False False False False
4 False False False False False
... ... ... ... ... ...
354364 True False False False False
354365 True True True True True
354366 False False False False False
354367 False False False False False
354368 False False False False True

354365 rows × 5 columns

with_na.isna().sum(axis=1).value_counts()
0    245810
1     64318
2     26209
3     10008
4      5783
5      2237
dtype: int64

All 5 data types play a significant role in determining the price. I think we can throw out rows where there are missing values in more than 2 columns. That's about 5% of the data.

data = data[~(data.isna().sum(axis=1) >= 3)]
data.isna().sum()
downloaded          0
price               0
vehicle_type    22122
year                0
gearbox          8954
power               0
model           12016
km                  0
month               0
fuel_type       18540
brand               0
repaired        55104
created             0
pictures            0
postal_code         0
visited             0
dtype: int64
  • For unknown body type — replace with other
  • For unknown type of gearbox — create class unknown
  • For unknown model — replace with other (at least the brand will be known)
  • For unknown fuel type — replace with other
  • For unknown breakdown history — replace with unknown (it is not clear whether we can trust and rely on this data).
data['vehicle_type'] = data['vehicle_type'].fillna('other')
data['gearbox'] = data['gearbox'].fillna('unknown')
data['model'] = data['model'].fillna('other')
data['fuel_type'] = data['fuel_type'].fillna('other')
data['repaired'] = data['repaired'].fillna('unknown')
data.isna().sum()
downloaded      0
price           0
vehicle_type    0
year            0
gearbox         0
power           0
model           0
km              0
month           0
fuel_type       0
brand           0
repaired        0
created         0
pictures        0
postal_code     0
visited         0
dtype: int64

Anomalies

So far, I've noticed the following outliers:

  • About 11,000 lines with zero price. This is hardly an anomaly because the seller may have planned to bargain with the buyer. Nevertheless, these columns won't help us, they should be thrown out.
  • 13 months of the year. Most likely, all the unknown information was assigned to month zero. It's a pity to throw out almost 40 000 rows. There is nothing to replace them with either. I think we can leave it that way and train the model on 13 months.
  • The prevailing number of never repaired machines (with 20% of missing values, by the way). There is no way to fix this, so it would be the best not to take this column for training at all.
  • Vehicle registration year before the 20th century and after 2019. The cutoff will be chosen at random.
  • Zero and overestimated engine horsepower.
  • Improbable mileage numbers. The lion's share of cars have 150,000 kilometers. This may have been one of the options offered to the seller, so it may be a categorical variable.
  • Zero photos for all items. This column should just be thrown out, it gives no information.
  • Data prior to March 5, 2016 is hardly an anomaly, but is out of trend. They should not be considered either.
data = data.drop(data[data['price'] == 0].index)
np.sort(data['year'].unique())
array([1000, 1400, 1600, 1602, 1910, 1923, 1927, 1928, 1929, 1930, 1931,
       1932, 1933, 1934, 1935, 1936, 1937, 1938, 1940, 1941, 1942, 1943,
       1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954,
       1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965,
       1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976,
       1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987,
       1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998,
       1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009,
       2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2900,
       3700, 4000, 5000, 6000, 6500])
np.array(data['year'].value_counts().sort_index())
array([    1,     1,     1,     1,    16,     2,     1,     1,     7,
           2,     1,     3,     2,     3,     3,     3,     8,     7,
           2,     2,     2,     4,     1,     4,     1,     1,     2,
           1,    17,    10,     8,    12,     8,    12,    22,    11,
          20,    29,    57,    36,    37,    48,    56,    74,   102,
          91,   102,   114,   156,   154,   202,   160,   163,   135,
         151,   162,   265,   259,   287,   239,   295,   395,   375,
         518,   486,   616,   882,  1210,  1983,  2554,  2806,  3220,
        4537,  7933, 10226, 13794, 16951, 21742, 20501, 19482, 18593,
       19387, 19225, 20278, 19460, 16843, 15119, 14271, 10584,  9763,
        7073,  4051,  2472,  1091,  6313,  7189,  2780,    14,     1,
           1,     1,     5,     3,     1])
data = data.loc[(1950 <= data['year']) & (data['year'] <= 2016)]
data['power'].value_counts().sort_index()
0        25996
1           20
2            8
3            8
4           26
         ...  
17932        1
19208        1
19211        1
19312        1
20000        1
Name: power, Length: 678, dtype: int64

Let's replace the zeros with the median.

median = data['power'].median()
data.loc[data['power'] == 0, 'power'] = median
np.array(data['power'].value_counts().sort_index().index).astype(int)
array([    1,     2,     3,     4,     5,     6,     7,     8,     9,
          10,    11,    12,    13,    14,    15,    16,    17,    18,
          19,    20,    21,    22,    23,    24,    25,    26,    27,
          28,    29,    30,    31,    32,    33,    34,    35,    36,
          37,    38,    39,    40,    41,    42,    43,    44,    45,
          46,    47,    48,    49,    50,    51,    52,    53,    54,
          55,    56,    57,    58,    59,    60,    61,    62,    63,
          64,    65,    66,    67,    68,    69,    70,    71,    72,
          73,    74,    75,    76,    77,    78,    79,    80,    81,
          82,    83,    84,    85,    86,    87,    88,    89,    90,
          91,    92,    93,    94,    95,    96,    97,    98,    99,
         100,   101,   102,   103,   104,   105,   106,   107,   108,
         109,   110,   111,   112,   113,   114,   115,   116,   117,
         118,   119,   120,   121,   122,   123,   124,   125,   126,
         127,   128,   129,   130,   131,   132,   133,   134,   135,
         136,   137,   138,   139,   140,   141,   142,   143,   144,
         145,   146,   147,   148,   149,   150,   151,   152,   153,
         154,   155,   156,   157,   158,   159,   160,   161,   162,
         163,   164,   165,   166,   167,   168,   169,   170,   171,
         172,   173,   174,   175,   176,   177,   178,   179,   180,
         181,   182,   183,   184,   185,   186,   187,   188,   189,
         190,   191,   192,   193,   194,   195,   196,   197,   198,
         199,   200,   201,   202,   203,   204,   205,   206,   207,
         208,   209,   210,   211,   212,   213,   214,   215,   216,
         217,   218,   219,   220,   221,   222,   223,   224,   225,
         226,   227,   228,   229,   230,   231,   232,   233,   234,
         235,   236,   237,   238,   239,   240,   241,   242,   243,
         244,   245,   246,   247,   248,   249,   250,   251,   252,
         253,   254,   255,   256,   257,   258,   259,   260,   261,
         262,   264,   265,   266,   267,   268,   269,   270,   271,
         272,   273,   274,   275,   276,   277,   278,   279,   280,
         281,   282,   283,   284,   285,   286,   287,   288,   289,
         290,   292,   293,   294,   295,   296,   297,   298,   299,
         300,   301,   303,   304,   305,   306,   307,   308,   309,
         310,   311,   313,   314,   315,   316,   317,   318,   320,
         321,   322,   323,   324,   325,   326,   328,   329,   330,
         331,   332,   333,   334,   335,   336,   337,   338,   339,
         340,   341,   343,   344,   345,   346,   347,   348,   349,
         350,   351,   352,   353,   354,   355,   356,   357,   358,
         360,   361,   362,   363,   364,   365,   367,   368,   370,
         371,   374,   375,   376,   377,   379,   380,   381,   382,
         385,   386,   387,   388,   390,   392,   394,   396,   398,
         399,   400,   401,   405,   408,   409,   411,   416,   420,
         421,   425,   426,   428,   430,   431,   435,   440,   442,
         445,   449,   450,   454,   457,   459,   460,   475,   476,
         487,   489,   490,   500,   504,   505,   507,   508,   510,
         514,   515,   517,   519,   520,   521,   525,   540,   541,
         544,   550,   551,   553,   560,   572,   579,   580,   584,
         585,   600,   601,   602,   603,   604,   606,   612,   620,
         640,   645,   650,   651,   671,   678,   682,   685,   696,
         700,   702,   703,   732,   743,   750,   751,   754,   771,
         776,   800,   805,   850,   851,   871,   900,   902,   903,
         907,   909,   923,   950,   952,   953,   998,   999,  1000,
        1001,  1002,  1003,  1004,  1005,  1011,  1012,  1016,  1017,
        1021,  1024,  1054,  1055,  1056,  1062,  1079,  1082,  1090,
        1100,  1105,  1115,  1120,  1149,  1151,  1158,  1160,  1162,
        1164,  1199,  1200,  1202,  1221,  1223,  1230,  1239,  1240,
        1241,  1250,  1275,  1288,  1299,  1300,  1312,  1317,  1324,
        1339,  1360,  1362,  1363,  1367,  1390,  1398,  1399,  1400,
        1401,  1403,  1405,  1416,  1432,  1433,  1436,  1500,  1501,
        1502,  1503,  1506,  1521,  1548,  1595,  1596,  1597,  1598,
        1600,  1625,  1631,  1653,  1659,  1689,  1700,  1701,  1703,
        1704,  1707,  1753,  1771,  1779,  1780,  1781,  1783,  1793,
        1796,  1799,  1800,  1801,  1870,  1895,  1896,  1900,  1920,
        1922,  1933,  1968,  1988,  1992,  1993,  1995,  1998,  1999,
        2000,  2004,  2005,  2007,  2009,  2016,  2017,  2018,  2172,
        2201,  2331,  2340,  2389,  2402,  2461,  2598,  2729,  2789,
        2792,  2799,  3199,  3454,  3500,  3750,  4400,  4507,  4700,
        5411,  5420,  5575,  5809,  5815,  5867,  6006,  6010,  6011,
        6012,  6018,  6045,  6062,  6226,  6512,  6920,  7508,  7511,
        7512,  7515,  7518,  7529,  7544,  8259,  8404,  8500,  9007,
        9012,  9013, 10110, 10218, 10311, 10520, 10522, 10710, 10910,
       10912, 11011, 11025, 11111, 11509, 11530, 11635, 12012, 12510,
       12512, 12684, 13636, 14009, 15001, 15017, 15020, 15033, 16051,
       16311, 16312, 17011, 17019, 17410, 17700, 17932, 19208, 19211,
       19312, 20000])
np.array(data['power'].value_counts().sort_index())
array([   20,     8,     8,    26,    93,    10,    10,     5,     4,
          10,    26,    12,     6,    18,    13,     6,     7,    47,
          10,    19,     5,     4,    37,    25,    14,   236,    37,
           5,    33,    28,     8,     3,    47,   167,    10,     6,
          43,    15,   125,   284,   419,    45,   104,   354,  2645,
          52,    29,    80,    22,  4187,    80,   288,   183,  5340,
        1804,   249,   125,  3586,   211, 14383,   957,    88,   386,
        1428,  2016,   169,   423,  2665,  3526,   915,  1379,   447,
         732,   341, 21644,   231,   846,   183,   132,  3816,   116,
        2180,   354,   911,   503,  2596,   757,  1382,    92, 11479,
         113,   427,    25,   379,  2834,   118,   506,   945,   584,
        1289, 12197,  6032,   886,   125, 35776,   387,  1238,   145,
        4564,  4396,   520,   177,   878,   852,  2829, 11273,   202,
        1153,    43,  3647,   352,  4924,    30,   206,  6405,   508,
          26,   385,  1406,   785,  4888,   430,   592,    54,   811,
        6350,    74,    46,   266, 12233,   693,    72,  5126,   651,
         977,   126,  1114,    77,    57, 13555,   210,   103,     6,
          62,   923,   791,    22,   196,    14,  2285,    30,    41,
        5849,    49,   846,   228,    98,    29,   108,  9594,   105,
          29,   194,  1191,   335,    48,  3272,    89,  2078,   624,
          25,    95,    21,  1575,   567,    47,    11,    84,    38,
        1309,    20,  1460,  2432,   144,    34,    69,  1093,    35,
          42,  1905,    88,     6,   111,  1664,   172,    10,   193,
          40,   166,    71,   901,    44,    37,    11,    61,    16,
          13,  2165,    12,   626,     5,    25,    46,  1957,   117,
         167,    11,    37,     3,   332,  1907,   691,    57,    40,
         617,     9,    25,    97,   607,    93,   267,    10,    22,
           4,   357,     7,    17,     1,    34,   726,     2,    67,
          13,    24,     9,   152,     4,   174,     5,   225,     4,
           8,    37,   425,     8,     2,    11,    12,    35,    38,
         726,    20,     9,    89,    18,     9,    12,   194,   231,
           8,     1,     7,    31,    14,   556,     9,     5,     6,
          56,    59,     2,     3,     4,     4,     1,    19,   178,
         194,     9,    11,     5,    72,   650,     4,     3,    12,
          77,    11,   102,    14,    19,     2,     1,     6,   108,
          14,    11,     1,     5,    25,   139,     6,    24,    44,
           1,     1,   166,    82,     9,     2,     1,     4,     1,
         205,     4,    46,   103,    34,     3,    30,     2,    64,
          28,     1,     3,     1,    66,    12,     1,     1,     2,
          41,     1,     3,    15,     5,     4,    75,     3,     7,
           2,     3,     4,     2,     1,     5,     7,     6,     2,
          15,     5,    54,     6,    10,     1,    15,     7,     1,
           4,    47,     4,     5,    10,     1,     2,     1,    22,
           1,     1,     2,     1,     3,     4,    34,     3,     1,
          11,     1,    81,     1,     2,     1,     1,     1,    31,
           2,     1,     1,    30,     2,     1,     9,     1,     4,
           4,     3,     2,     1,     2,     5,     1,     1,     1,
           1,     2,     2,     1,     1,     1,     3,     1,     1,
           2,     9,     3,     1,     5,     1,     1,     1,     2,
           1,     1,     2,     1,     1,     1,     1,     1,     1,
           1,     1,     1,     1,     1,     3,     1,     1,     1,
           1,     1,     1,     1,     1,     1,     2,     1,     1,
           1,     2,     1,     2,     1,     1,     1,     2,     3,
           6,     1,     2,     1,     1,     3,     1,     2,     1,
           2,     1,     1,     1,     1,     1,     1,     1,     1,
           1,     1,     1,     1,     1,     1,     1,     1,     1,
           1,     1,     1,     1,     1,     1,     1,     1,     1,
           1,     1,     1,     1,     1,     2,     1,     1,     1,
           1,     1,     1,     2,     1,     2,     1,     2,     5,
           2,     2,     1,     1,     1,     1,     1,     4,     1,
           1,     2,     1,     1,     1,     2,     1,     1,     7,
           3,     1,     1,     1,     1,     1,     1,     1,     1,
           1,     1,     2,     1,     2,     1,     2,     1,     1,
           1,     1,     6,     1,     1,     1,     1,     2,     1,
           1,     1,     1,     1,     1,     1,     1,     2,     1,
           2,     1,     1,     1,     1,     2,     1,     2,     1,
           1,     1,     1,     1,     1,     1,     1,     1,     1,
           1,     1,     1,     1,     1,     1,     1,     1,     2,
           1,     1,     1,     1,     1,     1,     1,     1,     1,
           1,     1,     1,     1,     1,     1,     1,     1,     2,
           1,     1,     1,     1,     1,     1,     1,     1,     1,
           1,     1,     1,     1,     1,     1,     2,     1,     1,
           1,     1,     1,     1,     1,     1,     1,     1,     2,
           2,     1,     1,     1,     1,     1,     1,     1,     1,
           1,     1,     1,     1,     1,     1,     1,     1,     1,
           1,     1])
data['power'].value_counts()[data['power'].value_counts() == 2645].index[0]
45.0

I'll take this power as the minimum possible.

data = data.loc[45 <= data['power']]
data.loc[data['power'] >= 4500, 'power'] = data.loc[data['power'] >= 4500, 'power'] / 10
data.loc[data['power'] >= 450, 'power'] = data.loc[data['power'] >= 450, 'power'] / 10
data['power'] = data['power'].astype(int)
data['power'].value_counts().sort_index()
45     2731
46       53
47       63
48       83
49       23
       ... 
435      34
440       4
442       1
445      11
449       1
Name: power, Length: 355, dtype: int64
data['power'].hist()
plt.show()

That's more likely to be true.

barplot_from_counts(data['km'].value_counts())

I'll transform it into a categorical one later.

data.drop('pictures', axis=1, inplace=True)

Uninformative feature removed.

data[data['created'] < '2016-03-05']
downloaded price vehicle_type year gearbox power model km month fuel_type brand repaired created postal_code visited
67 2016-03-21 12:47:55 2100 suv 2000 manual 105 other 150000 8 petrol honda no 2016-02-09 85244 2016-03-31 16:46:09
352 2016-03-05 14:14:19 3900 coupe 2004 manual 192 rx_reihe 125000 7 petrol mazda no 2016-03-03 57520 2016-03-29 05:18:01
581 2016-03-14 17:50:40 1990 bus 2003 manual 90 other 150000 10 gasoline ford no 2016-03-01 82467 2016-03-21 10:45:42
768 2016-03-05 14:25:04 1300 wagon 2001 manual 75 golf 125000 4 petrol volkswagen no 2016-03-04 65929 2016-03-06 11:11:24
856 2016-03-06 20:48:17 11200 small 2014 manual 105 other 30000 0 petrol skoda no 2016-03-01 41844 2016-04-07 02:45:38
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
353659 2016-03-05 14:08:59 1799 bus 2004 manual 75 combo 150000 5 gasoline opel no 2016-03-04 55481 2016-04-05 10:46:51
353856 2016-03-05 14:10:01 6300 sedan 2007 auto 116 astra 70000 9 petrol opel no 2016-03-03 63263 2016-03-28 22:18:58
354047 2016-03-05 14:18:10 1000 sedan 1993 manual 75 golf 150000 2 petrol volkswagen no 2016-03-04 35767 2016-03-06 04:45:43
354109 2016-03-05 14:23:04 17890 wagon 2011 auto 160 insignia 100000 6 gasoline opel no 2016-03-04 65321 2016-04-06 10:44:35
354179 2016-03-05 14:17:32 3200 sedan 2003 manual 102 a_klasse 150000 4 petrol mercedes_benz no 2016-03-04 35428 2016-03-15 09:47:14

1445 rows × 15 columns

data = data[data['created'] >= '2016-03-05']
data['price'].value_counts().sort_index().head(20)
1     756
2       7
3       4
5      15
7       2
9       5
10     38
11      1
12      2
13      3
15      3
17      1
20     11
24      1
25      6
26      1
29      1
30     17
35      6
38      1
Name: price, dtype: int64
data = data.drop(data[data['price'] < 35].index)

Exploratory data analysis

At this stage, we need to identify possible relationships between attributes, as well as select attributes for training. Since the target feature is price, we are looking first of all for the relationship with it.

Since some functions take a long time to work for 350,000 rows, let's study a sample of 35,000 objects.

sample = data.sample(frac=0.1, random_state=57)
sample.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 31350 entries, 262452 to 225675
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   downloaded    31350 non-null  datetime64[ns]
 1   price         31350 non-null  int64         
 2   vehicle_type  31350 non-null  object        
 3   year          31350 non-null  int64         
 4   gearbox       31350 non-null  object        
 5   power         31350 non-null  int64         
 6   model         31350 non-null  object        
 7   km            31350 non-null  int64         
 8   month         31350 non-null  int64         
 9   fuel_type     31350 non-null  object        
 10  brand         31350 non-null  object        
 11  repaired      31350 non-null  object        
 12  created       31350 non-null  datetime64[ns]
 13  postal_code   31350 non-null  int64         
 14  visited       31350 non-null  datetime64[ns]
dtypes: datetime64[ns](3), int64(6), object(6)
memory usage: 3.8+ MB

Period

Let's look at the dates the proposals were created.

sample['created'].dt.round('d').hist()
plt.xticks(rotation=30)
plt.show()
sb.lineplot(x=sample['created'].dt.round('d'), y=sample['price'], label='created')
plt.xticks(rotation=30)
plt.legend()
plt.show()

The price didn't change too much during that time. I think this column can be dropped.

sb.lineplot(x=sample['downloaded'].dt.round('d'), y=sample['price'], label='downloaded')
sb.lineplot(x=sample['visited'].dt.round('d'), y=sample['price'], label='visited')
plt.xticks(rotation=30)
plt.legend()
plt.show()

The time of download is unlikely to affect the price, but the date of last visit may well.

Let's convert it to seconds on a small sample and use it as a quantitative variable.

%%time
subsample = data.sample(frac=0.01, random_state=57)
seconds = subsample['visited'].apply(lambda x: 
                                     round((max(subsample['visited']) - x).total_seconds()))
CPU times: user 9.26 s, sys: 5.46 ms, total: 9.27 s
Wall time: 9.25 s
seconds.corr(subsample['price'])
-0.12227655531259655

There is some small correlation.

Postal code

The situation is similar with the postal code. Let's try to evaluate how meaningful it is to transform it into locations.

eu_countries = ['AD', 'AT', 'AX', 'BE', 'BG', 'CH', 'CZ', 
                'DE', 'DK', 'ES', 'FI', 'FO', 'FR', 'GB', 
                'GG', 'GL', 'HR', 'HU', 'IE', 'IM', 'IT', 
                'JE', 'LI', 'LT', 'LU', 'LV', 'MC', 'MD', 
                'MK', 'MT', 'NL', 'NO', 'PL', 'PM', 'PT', 
                'RO', 'SE', 'SI', 'SJ', 'SK', 'SM', 'VA']

Of the 324 cases, 246 are in Germany, another 52 offer to choose between Germany and some other country. 3 offers were not found among the European countries.

Within Germany, the classification is already more meaningful. However, it is still unclear which country to choose in some cases.

In general, let's not use this information when teaching. All the more it is calculated very long.

Vehicle body type

barplot_from_counts(sample['vehicle_type'].value_counts())
vehicle_stat = (sample
                .groupby('vehicle_type')['price']
                .agg(['count', 'mean', 'median'])
                .sort_values(by='count', ascending=False))
vehicle_stat
count mean median
vehicle_type
sedan 8744 4897.866651 3100
small 7491 2770.428247 1600
wagon 6372 5154.241996 3500
bus 2766 5489.066522 4200
convertible 1962 6910.125382 5500
coupe 1511 6261.535407 4250
other 1336 2909.621257 1450
suv 1168 9375.623288 8900
plt.bar(x=vehicle_stat.index, height=vehicle_stat['mean'], label='mean')
plt.bar(x=vehicle_stat.index, height=vehicle_stat['median'], label='median')
plt.legend()
plt.show()

In general, we can say that the rarer a car is, the more sellers ask for it.

Registration year

data['year'].unique()
array([1993, 2011, 2004, 2001, 2008, 1995, 2014, 1998, 2005, 2007, 2009,
       2002, 1997, 1990, 1981, 2003, 2016, 1994, 1991, 1984, 2006, 1999,
       2012, 2010, 1992, 2013, 2000, 1996, 1989, 2015, 1985, 1976, 1983,
       1969, 1971, 1987, 1973, 1986, 1982, 1988, 1980, 1965, 1970, 1974,
       1979, 1978, 1972, 1968, 1977, 1961, 1966, 1975, 1963, 1958, 1967,
       1964, 1960, 1951, 1959, 1953, 1956, 1962, 1950, 1955, 1952, 1957,
       1954])
year_stat = (sample
             .groupby('year')['price']
             .agg(['count', 'mean', 'median'])
             .sort_index())
year_stat
count mean median
year
1950 2 19250.000000 19250.0
1952 1 3800.000000 3800.0
1953 2 1874.500000 1874.5
1955 4 15337.500000 15850.0
1956 1 2990.000000 2990.0
... ... ... ...
2012 736 12225.645380 12050.0
2013 398 12892.635678 12900.0
2014 253 13220.869565 13700.0
2015 102 12784.882353 14030.0
2016 607 2670.039539 1200.0

65 rows × 3 columns

plt.plot(year_stat.index, year_stat['count'])
plt.title('Proposals')
plt.show()
plt.plot(year_stat.index, year_stat['mean'], label='mean')
plt.plot(year_stat.index, year_stat['median'], label='median')
plt.legend()
plt.show()

For the main dataset, there is a clear upward trend in price with novelty.

Gearbox

gearbox_stat = (sample
             .groupby('gearbox')['price']
             .agg(['count', 'mean', 'median'])
             .sort_index())
gearbox_stat
count mean median
gearbox
auto 6190 7245.962682 5950
manual 24345 4196.109427 2650
unknown 815 3013.309202 2250

Apparently, the price depends on the gearbox automation.

Engine power

power_stat = (sample
             .groupby('power')['price']
             .agg(['count', 'mean', 'median'])
             .sort_index())
power_stat
count mean median
power
45 268 1666.029851 699.5
46 10 4305.000000 3175.0
47 5 7599.800000 2500.0
48 12 4095.583333 2350.0
49 3 553.000000 499.0
... ... ... ...
420 2 8750.000000 8750.0
431 1 19990.000000 19990.0
435 5 16880.000000 17200.0
442 1 11900.000000 11900.0
445 1 11850.000000 11850.0

285 rows × 3 columns

sample['power'].hist(bins=200)
plt.title('Proposals')
plt.show()
plt.scatter(power_stat.index, power_stat['mean'], label='mean')
plt.scatter(power_stat.index, power_stat['median'], label='median')
plt.legend()
plt.show()
sample['price'].corr(sample['power'])
0.5059295126901395

Correlation, of course, does not imply causation. But there is still a reason to believe that price depends on power.

Car model

len(sample['model'].unique())
245

Let's check how unambiguously the models match the brands.

models_by_brand = sample.groupby('brand')['model'].unique()
models_by_brand
brand
alfa_romeo                      [156, 145, 147, other, spider, 159]
audi              [a4, q7, 90, a6, a3, a5, a8, 80, other, a1, 10...
bmw               [3er, 5er, other, x_reihe, 1er, z_reihe, 7er, ...
chevrolet                      [matiz, aveo, captiva, other, spark]
chrysler          [other, crossfire, voyager, ptcruiser, 300c, g...
citroen                       [c3, c5, other, c4, c2, c1, berlingo]
dacia                        [sandero, logan, duster, lodgy, other]
daewoo                         [matiz, nubira, kalos, other, lanos]
daihatsu          [sirion, terios, cuore, other, move, materia, ...
fiat              [punto, other, 500, bravo, doblo, stilo, seice...
ford              [ka, escort, focus, fiesta, other, mondeo, kug...
honda                        [cr_reihe, other, accord, civic, jazz]
hyundai                       [other, i_reihe, santa, tucson, getz]
jaguar                                      [other, x_type, s_type]
jeep                             [grand, other, cherokee, wrangler]
kia               [carnival, other, sorento, sportage, picanto, ...
lada                                          [niva, other, samara]
lancia            [other, lybra, ypsilon, elefantino, delta, mus...
land_rover        [freelander, range_rover, defender, discovery,...
mazda             [mx_reihe, other, 6_reihe, 3_reihe, rx_reihe, ...
mercedes_benz     [c_klasse, e_klasse, other, vito, s_klasse, a_...
mini                                  [one, cooper, other, clubman]
mitsubishi        [lancer, colt, carisma, other, pajero, galant,...
nissan            [micra, primera, almera, qashqai, other, juke,...
opel              [zafira, astra, meriva, corsa, vivaro, omega, ...
peugeot           [3_reihe, 2_reihe, other, 1_reihe, 4_reihe, 5_...
porsche                              [cayenne, boxster, other, 911]
renault           [twingo, clio, espace, laguna, megane, modus, ...
rover                                [other, discovery, freelander]
saab                                             [other, 900, 9000]
seat              [ibiza, arosa, leon, alhambra, altea, cordoba,...
skoda             [other, fabia, octavia, citigo, superb, yeti, ...
smart                            [fortwo, other, roadster, forfour]
sonstige_autos                                              [other]
subaru                    [legacy, other, impreza, justy, forester]
suzuki                                 [swift, other, grand, jimny]
toyota            [avensis, other, corolla, yaris, rav, auris, a...
trabant                                                [601, other]
volkswagen        [passat, golf, polo, transporter, up, touareg,...
volvo             [v70, c_reihe, other, s60, v50, xc_reihe, v40,...
Name: model, dtype: object
sets = []
for brand_from in models_by_brand.index:
    for brand_to in models_by_brand.index:
        if brand_from != brand_to:
            sets.append(set(models_by_brand[brand_from]) & set(models_by_brand[brand_to]))
# sets
for i in sets:
    if i != {'other'}:
        print(i)
{'matiz', 'other'}
{'grand', 'other'}
{'grand', 'other'}
{'matiz', 'other'}
{'grand', 'other'}
{'grand', 'other'}
{'freelander', 'discovery', 'other'}
{'1_reihe', '3_reihe', 'other', '5_reihe'}
{'1_reihe', '3_reihe', 'other', '5_reihe'}
{'freelander', 'discovery', 'other'}
{'grand', 'other'}
{'grand', 'other'}

By and large, the models don't overlap. The question is whether to use them in training or make do with the brand. I think it is better not to take it because it will encourage overtraining.

Mileage

km_stat = (sample.groupby('km')['price'].agg(['count', 'mean', 'median']).sort_index())
km_stat
count mean median
km
5000 331 4160.265861 1700.0
10000 80 11537.625000 12110.5
20000 367 9311.476839 9000.0
30000 418 10108.069378 9900.0
40000 451 9965.481153 9300.0
50000 559 9393.166369 8500.0
60000 754 9143.026525 8100.0
70000 752 8275.559840 7145.0
80000 940 7433.852128 6400.0
90000 1073 7013.849021 5999.0
100000 1332 6306.957207 5000.0
125000 3304 5356.416162 3800.0
150000 20989 3623.120444 2250.0
plt.plot(km_stat.index, km_stat['mean'], label='mean')
plt.plot(km_stat.index, km_stat['median'], label='median')
plt.legend()
plt.show()

It turns out that almost unused cars are valued low. This may be due to the year of manufacture.

km_by_year = sample[sample['km'] == 5000]['year'].value_counts().sort_index()
plt.bar(x=km_by_year.index, height=km_by_year)
plt.title('5000 km')
plt.show()
km_by_year_total = sample['year'].value_counts().sort_index()
plt.bar(x=km_by_year_total.index, height=km_by_year_total)
plt.title('All')
plt.show()

No, it doesn't look like it.

Month

sample.groupby('month')['price']
<pandas.core.groupby.generic.SeriesGroupBy object at 0x7327db9c20d0>
month_stat = (sample
             .groupby('month')['price']
             .agg(['count', 'mean', 'median'])
             .sort_index())
month_stat
count mean median
month
0 2153 2375.229447 1300
1 2048 4733.095703 2999
2 1902 4789.637224 3000
3 3224 4755.536290 2999
4 2729 5059.130817 3400
5 2690 4927.915613 3350
6 2827 4869.239830 3100
7 2506 5104.338787 3500
8 2123 4709.706076 2990
9 2220 5188.923423 3500
10 2407 5014.046115 3400
11 2320 5215.491379 3550
12 2201 4955.707406 3300
plt.plot(month_stat.index, month_stat['mean'], label='mean')
plt.plot(month_stat.index, month_stat['median'], label='median')
plt.title('Price by month of registration')
plt.legend()
plt.show()

Seems like this column does not affect the price.

sample['price'].corr(sample['month'])
0.08204929198706952

Apparently not. Let's not train on it.

Fuel type

fuel_stat = (sample
             .groupby('fuel_type')['price']
             .agg(['count', 'mean', 'median'])
             .sort_index())
fuel_stat
count mean median
fuel_type
cng 57 4819.192982 4500
electric 4 10912.500000 10375
gasoline 9350 6892.283850 5699
hybrid 23 9472.130435 10900
lpg 454 4478.830396 3333
other 1481 2882.409183 1500
petrol 19981 3912.784445 2250
x_axis = np.arange(len(fuel_stat))
plt.bar(x=x_axis - 0.2, width=0.3, height=fuel_stat['mean'], label='mean')
plt.bar(x=x_axis + 0.2, width=0.3, height=fuel_stat['median'], label='median')
plt.xticks(x_axis, fuel_stat.index)
plt.title('Price by fuel type')
plt.legend()
plt.show()

There is little data for electric and hybrid-fueled vehicles, but there appears to be a correlation.

Brand

brand_stat = (sample
             .groupby('brand')['price']
             .agg(['count', 'mean', 'median'])
             .sort_values(by='mean'))
brand_stat
count mean median
brand
daewoo 45 977.800000 900.0
rover 36 1480.666667 950.0
daihatsu 61 2081.918033 1300.0
trabant 9 2097.000000 2000.0
renault 1576 2401.462563 1350.0
lancia 37 2486.270270 1300.0
fiat 805 2903.925466 1750.0
opel 3534 3017.632428 1700.0
mitsubishi 264 3332.837121 1600.0
ford 2263 3393.979231 1800.0
peugeot 972 3426.346708 2449.5
chrysler 133 3527.052632 1990.0
subaru 57 3576.000000 1700.0
citroen 421 3770.857482 2650.0
smart 431 3777.895592 3100.0
mazda 539 3794.000000 2500.0
alfa_romeo 213 3824.582160 2400.0
lada 24 4133.625000 3499.5
honda 277 4185.050542 2950.0
suzuki 219 4293.141553 3000.0
seat 616 4425.188312 2750.0
nissan 429 4457.016317 2350.0
saab 44 4657.363636 2800.0
volkswagen 6631 4747.163022 2990.0
toyota 435 4771.425287 3990.0
volvo 279 4807.139785 3100.0
kia 210 5155.023810 3325.0
hyundai 329 5383.458967 3900.0
chevrolet 145 5511.310345 4000.0
dacia 75 5615.960000 4250.0
mercedes_benz 2953 6136.944802 4500.0
skoda 521 6298.978887 5300.0
bmw 3300 6494.579091 5250.0
jaguar 58 6699.724138 5500.0
audi 2736 6731.476243 5500.0
sonstige_autos 185 7300.281081 6000.0
jeep 62 7392.354839 6374.5
mini 299 9378.066890 8550.0
land_rover 68 10315.044118 10375.0
porsche 59 12445.237288 13500.0
x_axis = np.arange(len(brand_stat))
fig, ax = plt.subplots(figsize=(12, 9))
ax.barh(x_axis, brand_stat['mean'], 0.4, label='mean')
ax.barh(x_axis + 0.4, brand_stat['median'], 0.4, label='median')

ax.set(yticks=x_axis + 0.4, yticklabels=brand_stat.index, ylim=[2*0.4 - 1, len(brand_stat)])
ax.legend(fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.title('Price by brand (euros)', fontsize=15)
plt.show()

Not surprisingly, there's a dependency here.

Breakdown history

repair_stat = (sample
             .groupby('repaired')['price']
             .agg(['count', 'mean', 'median', 'std'])
             .sort_values(by='mean'))
repair_stat
count mean median std
repaired
yes 3190 2158.423511 1000 2909.581567
unknown 4807 3133.212191 1800 3558.690219
no 23353 5460.365606 3850 4753.213788

I wonder if the price depends on the year the application was created.

sb.lineplot(x=sample[sample['repaired'] == 'yes']['created'].round('d'), 
            y=sample[sample['repaired'] == 'yes']['price'], label='yes')
sb.lineplot(x=sample[sample['repaired'] == 'no']['created'].round('d'), 
            y=sample[sample['repaired'] == 'no']['price'], label='no')
sb.lineplot(x=sample[sample['repaired'] == 'unknown']['created'].round('d'), 
            y=sample[sample['repaired'] == 'unknown']['price'], label='unknown')
plt.xticks(rotation=30)
plt.legend()
plt.show()

If we put a standard deviation instead of a confidence interval, it's not so pretty.

sb.lineplot(x=sample[sample['repaired'] == 'yes']['created'].round('d'), 
            y=sample[sample['repaired'] == 'yes']['price'], label='yes', errorbar='sd')
sb.lineplot(x=sample[sample['repaired'] == 'no']['created'].round('d'), 
            y=sample[sample['repaired'] == 'no']['price'], label='no', errorbar='sd')
sb.lineplot(x=sample[sample['repaired'] == 'unknown']['created'].round('d'), 
            y=sample[sample['repaired'] == 'unknown']['price'], label='unknown', errorbar='sd')
plt.xticks(rotation=30)
plt.legend()
plt.show()

That's essentially how it should be though. Unbroken cars of different price ranges can be sold on the same day, so nothing strange here. The sign is significant, even if it doesn't reflect reality.

Relationship of attributes

corr_matrix = data.corr()
sb.heatmap(corr_matrix, cmap='flare', annot=True)
plt.yticks(rotation=0)
plt.show()

The price is dependent on a number of attributes, which cannot but rejoice. As for the attributes for training, there is no multicollinearity between them. Except for a little bit between year and mileage.

Samples

Since ~350,000 objects can take a long time to process, let's take 25% of this data.

For them we also need to:

  1. select the features for the training
  2. conduct OHE for categorical variables
  3. conduct scaling for numerical variables
sample = data.sample(frac=0.25, random_state=57)
sample.dtypes
downloaded      datetime64[ns]
price                    int64
vehicle_type            object
year                     int64
gearbox                 object
power                    int64
model                   object
km                       int64
month                    int64
fuel_type               object
brand                   object
repaired                object
created         datetime64[ns]
postal_code              int64
visited         datetime64[ns]
dtype: object

Features for training

features = sample.drop(['downloaded', 
                        'price', 
                        'model', 
                        'month', 
                        'created', 
                        'postal_code'], axis=1)
target = sample['price']
print(features.shape)
print(target.shape)
(78374, 9)
(78374,)
features
vehicle_type year gearbox power km fuel_type brand repaired visited
262452 wagon 1999 manual 125 150000 petrol volkswagen unknown 2016-04-05 12:17:29
323263 wagon 1996 manual 116 150000 petrol bmw no 2016-04-06 15:18:01
139907 small 2009 manual 69 60000 petrol ford unknown 2016-04-06 00:44:44
112881 sedan 2002 manual 163 150000 petrol mercedes_benz no 2016-04-01 16:55:42
72971 wagon 2006 manual 116 150000 gasoline audi no 2016-03-07 15:45:13
... ... ... ... ... ... ... ... ... ...
172674 wagon 2005 manual 116 150000 gasoline bmw no 2016-03-13 01:17:49
162873 convertible 2005 manual 101 150000 petrol opel no 2016-03-12 21:46:56
37884 sedan 2007 manual 224 150000 gasoline mercedes_benz no 2016-04-05 16:46:24
178618 wagon 2002 auto 143 150000 gasoline mercedes_benz unknown 2016-03-27 18:44:43
121151 wagon 1999 manual 110 150000 gasoline volkswagen yes 2016-04-05 13:18:27

78374 rows × 9 columns

features.drop('visited', axis=1, inplace=True)
features
vehicle_type year gearbox power km fuel_type brand repaired
262452 wagon 1999 manual 125 150000 petrol volkswagen unknown
323263 wagon 1996 manual 116 150000 petrol bmw no
139907 small 2009 manual 69 60000 petrol ford unknown
112881 sedan 2002 manual 163 150000 petrol mercedes_benz no
72971 wagon 2006 manual 116 150000 gasoline audi no
... ... ... ... ... ... ... ... ...
172674 wagon 2005 manual 116 150000 gasoline bmw no
162873 convertible 2005 manual 101 150000 petrol opel no
37884 sedan 2007 manual 224 150000 gasoline mercedes_benz no
178618 wagon 2002 auto 143 150000 gasoline mercedes_benz unknown
121151 wagon 1999 manual 110 150000 gasoline volkswagen yes

78374 rows × 8 columns

This is the form in which the data will come into work.

Splitting

Let's divide the samples into training, validation and test samples. This is necessary even for models with cross-validation, because we will need to conduct a mini-test after parameter selection.

features_train, features_valid, features_test = np.split(features.sample(
    frac=1, random_state=57),[int(0.6*len(features)), int(0.8*len(features))])
target_train, target_valid, target_test = np.split(target.sample(
    frac=1, random_state=57),[int(0.6*len(target)), int(0.8*len(target))])
# .sample(frac=1) to mix the dataframe, then the indexes by which we split the data
print(features_train.shape)
print(features_valid.shape)
print(features_test.shape)

print(target_train.shape)
print(target_valid.shape)
print(target_test.shape)
(47024, 8)
(15675, 8)
(15675, 8)
(47024,)
(15675,)
(15675,)
features_train
vehicle_type year gearbox power km fuel_type brand repaired
159639 wagon 2005 auto 140 150000 gasoline audi no
34735 sedan 2003 manual 150 90000 gasoline mercedes_benz no
313161 other 2000 manual 105 100000 petrol fiat unknown
263346 bus 2002 manual 147 150000 petrol opel no
126860 sedan 1998 manual 101 150000 other volkswagen no
... ... ... ... ... ... ... ... ...
69954 sedan 1991 manual 105 150000 other opel no
278609 coupe 1999 manual 150 150000 petrol bmw no
3433 sedan 2004 manual 105 125000 petrol seat no
43224 sedan 1994 manual 136 125000 petrol mercedes_benz no
86464 small 2006 manual 64 70000 petrol volkswagen no

47024 rows × 8 columns

target_train
159639    6600
34735     8900
313161    1500
263346    2100
126860    1500
          ... 
69954      800
278609    3000
3433      3750
43224     1111
86464     5300
Name: price, Length: 47024, dtype: int64

Scaling

We will scale the motor power values, as well as time in seconds, if used.

scaler = StandardScaler()

try:
    scaler.fit(features_train[['power', 'seconds']])
    features_train[['power', 'seconds']] = scaler.transform(
        features_train[['power', 'seconds']])
    features_valid[['power', 'seconds']] = scaler.transform(
        features_valid[['power', 'seconds']])
    features_test[['power', 'seconds']] = scaler.transform(
        features_test[['power', 'seconds']])
except:
    scaler.fit(features_train[['power']])
    features_train['power'] = scaler.transform(features_train[['power']])
    features_valid['power'] = scaler.transform(features_valid[['power']])
    features_test['power'] = scaler.transform(features_test[['power']])
features_train
vehicle_type year gearbox power km fuel_type brand repaired
159639 wagon 2005 auto 0.406812 150000 gasoline audi no
34735 sedan 2003 manual 0.605208 90000 gasoline mercedes_benz no
313161 other 2000 manual -0.287575 100000 petrol fiat unknown
263346 bus 2002 manual 0.545689 150000 petrol opel no
126860 sedan 1998 manual -0.366933 150000 other volkswagen no
... ... ... ... ... ... ... ... ...
69954 sedan 1991 manual -0.287575 150000 other opel no
278609 coupe 1999 manual 0.605208 150000 petrol bmw no
3433 sedan 2004 manual -0.287575 125000 petrol seat no
43224 sedan 1994 manual 0.327453 125000 petrol mercedes_benz no
86464 small 2006 manual -1.100999 70000 petrol volkswagen no

47024 rows × 8 columns

Samples with one-hot encoding

features_train[['year', 'km']] = features_train[['year', 'km']].astype(str)
features_valid[['year', 'km']] = features_valid[['year', 'km']].astype(str)
features_test[['year', 'km']] = features_test[['year', 'km']].astype(str)
features_valid
vehicle_type year gearbox power km fuel_type brand repaired
304403 other 2012 manual -0.188377 40000 petrol kia unknown
104873 other 2016 manual -0.287575 150000 petrol opel unknown
270484 small 2005 manual -1.180358 125000 petrol ford no
198191 small 1998 manual -0.664528 150000 petrol toyota no
32356 small 2012 manual 0.049699 20000 petrol mini no
... ... ... ... ... ... ... ... ...
16674 sedan 1992 manual 1.081359 150000 petrol audi no
110158 small 2002 manual -0.882763 80000 petrol opel no
227995 wagon 2012 manual -0.545490 60000 gasoline peugeot no
130757 wagon 2012 auto 0.406812 150000 gasoline volkswagen no
47062 wagon 2006 auto 1.002000 150000 gasoline volkswagen no

15675 rows × 8 columns

features_train_ohe = pd.get_dummies(features_train, drop_first=True, 
                                    columns=features_train.columns.drop('power'))
features_valid_ohe = pd.get_dummies(features_valid, drop_first=True, 
                                    columns=features_valid.columns.drop('power'))
features_test_ohe = pd.get_dummies(features_test, drop_first=True, 
                                    columns=features_test.columns.drop('power'))
features_train_ohe
power vehicle_type_convertible vehicle_type_coupe vehicle_type_other vehicle_type_sedan vehicle_type_small vehicle_type_suv vehicle_type_wagon year_1953 year_1955 ... brand_smart brand_sonstige_autos brand_subaru brand_suzuki brand_toyota brand_trabant brand_volkswagen brand_volvo repaired_unknown repaired_yes
159639 0.406812 0 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
34735 0.605208 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
313161 -0.287575 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
263346 0.545689 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
126860 -0.366933 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
69954 -0.287575 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
278609 0.605208 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3433 -0.287575 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
43224 0.327453 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
86464 -1.100999 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 0

47024 rows × 132 columns

features_valid_ohe
power vehicle_type_convertible vehicle_type_coupe vehicle_type_other vehicle_type_sedan vehicle_type_small vehicle_type_suv vehicle_type_wagon year_1955 year_1958 ... brand_smart brand_sonstige_autos brand_subaru brand_suzuki brand_toyota brand_trabant brand_volkswagen brand_volvo repaired_unknown repaired_yes
304403 -0.188377 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
104873 -0.287575 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
270484 -1.180358 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
198191 -0.664528 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 0
32356 0.049699 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
16674 1.081359 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
110158 -0.882763 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
227995 -0.545490 0 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
130757 0.406812 0 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 1 0 0 0
47062 1.002000 0 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 1 0 0 0

15675 rows × 128 columns

common_columns = (set(features_train_ohe.columns) & 
                  set(features_valid_ohe.columns) & 
                  set(features_test_ohe.columns))
len(common_columns)
126

The number of columns is different because not all unique values may be in the test sample, for example.

features_train_ohe = features_train_ohe[list(common_columns)]
features_valid_ohe = features_valid_ohe[list(common_columns)]
features_test_ohe = features_test_ohe[list(common_columns)]
features_train_ohe.shape
(47024, 126)

Samples with ordinal encoding

features_train
vehicle_type year gearbox power km fuel_type brand repaired
159639 wagon 2005 auto 0.406812 150000 gasoline audi no
34735 sedan 2003 manual 0.605208 90000 gasoline mercedes_benz no
313161 other 2000 manual -0.287575 100000 petrol fiat unknown
263346 bus 2002 manual 0.545689 150000 petrol opel no
126860 sedan 1998 manual -0.366933 150000 other volkswagen no
... ... ... ... ... ... ... ... ...
69954 sedan 1991 manual -0.287575 150000 other opel no
278609 coupe 1999 manual 0.605208 150000 petrol bmw no
3433 sedan 2004 manual -0.287575 125000 petrol seat no
43224 sedan 1994 manual 0.327453 125000 petrol mercedes_benz no
86464 small 2006 manual -1.100999 70000 petrol volkswagen no

47024 rows × 8 columns

encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=999)
cat_features = list(features.columns.drop('power'))
encoder.fit(features_train[cat_features])
OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=999)
features_train_ord = features_train.copy()
features_valid_ord = features_valid.copy()
features_test_ord = features_test.copy()

features_train_ord[cat_features] = encoder.transform(features_train_ord[cat_features])
features_valid_ord[cat_features] = encoder.transform(features_valid_ord[cat_features])
features_test_ord[cat_features] = encoder.transform(features_test_ord[cat_features])
features_train_ord
vehicle_type year gearbox power km fuel_type brand repaired
159639 7.0 52.0 0.0 0.406812 3.0 2.0 1.0 0.0
34735 4.0 50.0 1.0 0.605208 12.0 2.0 20.0 0.0
313161 3.0 47.0 1.0 -0.287575 1.0 6.0 9.0 1.0
263346 0.0 49.0 1.0 0.545689 3.0 6.0 24.0 0.0
126860 4.0 45.0 1.0 -0.366933 3.0 5.0 38.0 0.0
... ... ... ... ... ... ... ... ...
69954 4.0 38.0 1.0 -0.287575 3.0 5.0 24.0 0.0
278609 2.0 46.0 1.0 0.605208 3.0 6.0 2.0 0.0
3433 4.0 51.0 1.0 -0.287575 2.0 6.0 30.0 0.0
43224 4.0 41.0 1.0 0.327453 2.0 6.0 20.0 0.0
86464 5.0 53.0 1.0 -1.100999 10.0 6.0 38.0 0.0

47024 rows × 8 columns

Training

Let's try linear regression, random forest, LightGBM, CatBoost and XGBoost.

Linear regression

%%time
model = LinearRegression()
model.fit(features_train_ohe, target_train)
CPU times: user 705 ms, sys: 172 ms, total: 877 ms
Wall time: 293 ms
LinearRegression()
%%time
predictions = model.predict(features_valid_ohe)
mse(target_valid, predictions)**0.5
CPU times: user 61.2 ms, sys: 6.02 ms, total: 67.2 ms
Wall time: 15.9 ms
2181.4096215147615

Random forest

For decision trees, we will use ordinal encoded samples.

%%time
grid_params = {'n_estimators': [50, 150, 300], 
               'max_depth': [5, 7, 9]}

model = GridSearchCV(estimator=RandomForestRegressor(random_state=57), 
                     param_grid=grid_params, 
                     scoring='neg_mean_squared_error')

model.fit(features_train_ord, target_train)
print(model.best_params_)
{'max_depth': 9, 'n_estimators': 300}
CPU times: user 5min 6s, sys: 195 ms, total: 5min 6s
Wall time: 5min 6s
%%time
model = RandomForestRegressor(n_estimators=300, max_depth=9, random_state=57)
model.fit(features_train_ord, target_train)
CPU times: user 19.4 s, sys: 2.87 ms, total: 19.4 s
Wall time: 19.4 s
RandomForestRegressor(max_depth=9, n_estimators=300, random_state=57)
%%time
predictions = model.predict(features_valid_ord)
mse(target_valid, predictions)**0.5
CPU times: user 498 ms, sys: 2.01 ms, total: 500 ms
Wall time: 497 ms
2006.486230433868

LightGBM

%%time
params_grid = {'learning_rate': [0.05, 0.1, 0.2], 
               'n_estimators': [50, 100, 150, 200], 
               'colsample_bytree': [0.5, 0.7, 1.0]}

model = GridSearchCV(estimator=lgb.LGBMRegressor(objective='regression', random_state=57), 
                     param_grid=params_grid, 
                     scoring='neg_mean_squared_error')

model.fit(features_train_ohe, target_train)
print(model.best_params_)
{'colsample_bytree': 1.0, 'learning_rate': 0.2, 'n_estimators': 200}
CPU times: user 16min 34s, sys: 56.5 s, total: 17min 30s
Wall time: 2min 21s
%%time
model = lgb.LGBMRegressor(objective='regression', 
                          colsample_bytree=1.0, 
                          learning_rate=0.2, 
                          n_estimators=200,
                          random_state=57)
model.fit(features_train_ohe, target_train)
CPU times: user 7.51 s, sys: 401 ms, total: 7.91 s
Wall time: 1.03 s
LGBMRegressor(learning_rate=0.2, n_estimators=200, objective='regression',
              random_state=57)
%%time
predictions = model.predict(features_valid_ohe)
mse(target_valid, predictions)**0.5
CPU times: user 1.1 s, sys: 58.8 ms, total: 1.16 s
Wall time: 158 ms
1782.2125817272529

Let's also try to train with categories.

features_train
vehicle_type year gearbox power km fuel_type brand repaired
159639 wagon 2005 auto 0.406812 150000 gasoline audi no
34735 sedan 2003 manual 0.605208 90000 gasoline mercedes_benz no
313161 other 2000 manual -0.287575 100000 petrol fiat unknown
263346 bus 2002 manual 0.545689 150000 petrol opel no
126860 sedan 1998 manual -0.366933 150000 other volkswagen no
... ... ... ... ... ... ... ... ...
69954 sedan 1991 manual -0.287575 150000 other opel no
278609 coupe 1999 manual 0.605208 150000 petrol bmw no
3433 sedan 2004 manual -0.287575 125000 petrol seat no
43224 sedan 1994 manual 0.327453 125000 petrol mercedes_benz no
86464 small 2006 manual -1.100999 70000 petrol volkswagen no

47024 rows × 8 columns

features_train_cat = features_train.copy()
features_valid_cat = features_valid.copy()
features_test_cat = features_test.copy()

for column in features_train_cat.columns:
    if features_train_cat[column].dtype == 'object':
        features_train_cat[column] = features_train_cat[column].astype('category')
for column in features_valid_cat.columns:
    if features_valid_cat[column].dtype == 'object':
        features_valid_cat[column] = features_valid_cat[column].astype('category')
for column in features_test_cat.columns:
    if features_test_cat[column].dtype == 'object':
        features_test_cat[column] = features_test_cat[column].astype('category')
features_valid_cat.dtypes
vehicle_type    category
year            category
gearbox         category
power            float64
km              category
fuel_type       category
brand           category
repaired        category
dtype: object
%%time
params_grid = {'learning_rate': [0.05, 0.1, 0.2], 
               'n_estimators': [50, 100, 150, 200], 
               'colsample_bytree': [0.5, 0.7, 1.0]}

model = GridSearchCV(estimator=lgb.LGBMRegressor(objective='regression', random_state=57), 
                     param_grid=params_grid, 
                     scoring='neg_mean_squared_error')

model.fit(features_train_cat, target_train)
print(model.best_params_)
{'colsample_bytree': 1.0, 'learning_rate': 0.1, 'n_estimators': 200}
CPU times: user 17min 16s, sys: 57.9 s, total: 18min 14s
Wall time: 2min 30s
%%time
model = lgb.LGBMRegressor(objective='regression', 
                          colsample_bytree=1.0, 
                          learning_rate=0.1, 
                          n_estimators=200,
                          random_state=57)
model.fit(features_train_cat, target_train)
CPU times: user 5.58 s, sys: 210 ms, total: 5.79 s
Wall time: 744 ms
LGBMRegressor(n_estimators=200, objective='regression', random_state=57)
%%time
predictions = model.predict(features_valid_cat)
mse(target_valid, predictions)**0.5
CPU times: user 1.45 s, sys: 51.9 ms, total: 1.5 s
Wall time: 199 ms
1715.9009475795185

That worked out a little better.

CatBoost

Let's try CatBoost regression now.

features_train
vehicle_type year gearbox power km fuel_type brand repaired
159639 wagon 2005 auto 0.406812 150000 gasoline audi no
34735 sedan 2003 manual 0.605208 90000 gasoline mercedes_benz no
313161 other 2000 manual -0.287575 100000 petrol fiat unknown
263346 bus 2002 manual 0.545689 150000 petrol opel no
126860 sedan 1998 manual -0.366933 150000 other volkswagen no
... ... ... ... ... ... ... ... ...
69954 sedan 1991 manual -0.287575 150000 other opel no
278609 coupe 1999 manual 0.605208 150000 petrol bmw no
3433 sedan 2004 manual -0.287575 125000 petrol seat no
43224 sedan 1994 manual 0.327453 125000 petrol mercedes_benz no
86464 small 2006 manual -1.100999 70000 petrol volkswagen no

47024 rows × 8 columns

cat_features
['vehicle_type', 'year', 'gearbox', 'km', 'fuel_type', 'brand', 'repaired']
%%time
model = CatBoostRegressor(loss_function='RMSE', cat_features=cat_features, logging_level='Silent')
grid = {'iterations': [100, 150, 200, 250, 300],
        'learning_rate': [0.03, 0.05, 0.1],
        'depth': [2, 4, 6, 8]}
model.grid_search(grid, features_train, target_train, verbose=False)
print(model.get_params())
{'loss_function': 'RMSE', 'logging_level': 'Silent', 'cat_features': ['vehicle_type', 'year', 'gearbox', 'km', 'fuel_type', 'brand', 'repaired'], 'depth': 8, 'iterations': 300, 'learning_rate': 0.1}
CPU times: user 38min 45s, sys: 4min 54s, total: 43min 40s
Wall time: 6min 39s
%%time
model = CatBoostRegressor(loss_function='RMSE', 
                          depth=8, 
                          iterations=300, 
                          learning_rate=0.1, 
                          verbose=False, 
                          cat_features=cat_features)
model.fit(features_train, target_train, verbose=False)
CPU times: user 1min 55s, sys: 13.4 s, total: 2min 9s
Wall time: 21.2 s
<catboost.core.CatBoostRegressor at 0x7327db366df0>
%%time
predictions = model.predict(features_valid)
rmse = (np.sqrt(mse(target_valid, predictions)))
print('RMSE: {:.2f}'.format(rmse))
RMSE: 1730.46
CPU times: user 191 ms, sys: 30 ms, total: 221 ms
Wall time: 61.4 ms

XGBoost

%%time
params = {'max_depth': [5, 6, 7], 
          'learning_rate': [0.02, 0.05, 0.1], 
          'colsample_bytree': [0.3, 0.5, 0.7]}

grid_search = GridSearchCV(estimator=xgboost.XGBRegressor(), 
                           param_grid=params, 
                           scoring='neg_mean_squared_error')

grid_search.fit(features_train_ohe, target_train)
print(grid_search.best_params_)
{'colsample_bytree': 0.5, 'learning_rate': 0.1, 'max_depth': 7}
CPU times: user 2h 11min 59s, sys: 5min 13s, total: 2h 17min 13s
Wall time: 18min 40s
%%time
model = xgboost.XGBRegressor(colsample_bytree=0.5, 
                             learning_rate=0.1, 
                             max_depth=7, 
                             silent=True)
model.fit(features_train_ohe, target_train)
[13:37:53] WARNING: ../src/learner.cc:767: 
Parameters: { "silent" } are not used.

CPU times: user 1min 30s, sys: 3.98 s, total: 1min 34s
Wall time: 13.5 s
XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=0.5, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=0.1, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=7, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             n_estimators=100, n_jobs=None, num_parallel_tree=None,
             predictor=None, random_state=None, ...)
%%time
predictions = grid_search.predict(features_valid_ohe)
mse(target_valid, predictions)**0.5
CPU times: user 533 ms, sys: 35.3 ms, total: 568 ms
Wall time: 84.7 ms
1887.3826840822685

Results

Thus, we obtain the following table.

Model Hyperparameter selection Training time Prediction time RMSE
Linear regression - 293 ms 16 ms 2181
Random forest 5 minutes 6 seconds 19 seconds 497 ms 2006
LightGBM 2 minutes 30 seconds 744 ms 199 ms 1716
CatBoost 6 minutes 39 seconds 21 seconds 61 ms 1730
XGBoost 18 minutes 40 seconds 14 seconds 85 ms 1887

Sanity check

%%time
model = DummyRegressor()
model.fit(features_train_ohe, target_train)
predictions = model.predict(features_valid_ohe)
mse(target_valid, predictions)**0.5
CPU times: user 23 ms, sys: 3.94 ms, total: 26.9 ms
Wall time: 5.75 ms
4586.095252190357

All of our models pass this minimum.

Testing

We test LightGBM with parameters 'colsample_bytree': 1.0, 'learning_rate': 0.1, 'n_estimators': 200.

features_test_cat
vehicle_type year gearbox power km fuel_type brand repaired
49749 wagon 2006 manual 0.406812 150000 other volkswagen no
161214 small 1997 manual -0.882763 150000 petrol volkswagen no
244831 wagon 2008 manual 1.180557 150000 gasoline audi no
211493 convertible 2002 manual 4.434254 150000 petrol bmw no
52708 small 2003 manual -1.220037 125000 petrol renault no
... ... ... ... ... ... ... ... ...
287496 small 2010 auto -1.299395 90000 gasoline smart no
202431 wagon 2011 manual -0.069339 80000 gasoline ford no
236513 small 1996 manual -0.585169 150000 petrol opel no
319004 wagon 2003 auto 0.863123 150000 gasoline audi no
27170 sedan 2003 manual -0.327254 150000 petrol fiat no

15675 rows × 8 columns

We will train on both the training and validation dataset.

features_for_test_training = pd.concat([features_train_cat, features_valid_cat])
target_for_test_training = pd.concat([target_train, target_valid])
features_for_test_training
vehicle_type year gearbox power km fuel_type brand repaired
159639 wagon 2005 auto 0.406812 150000 gasoline audi no
34735 sedan 2003 manual 0.605208 90000 gasoline mercedes_benz no
313161 other 2000 manual -0.287575 100000 petrol fiat unknown
263346 bus 2002 manual 0.545689 150000 petrol opel no
126860 sedan 1998 manual -0.366933 150000 other volkswagen no
... ... ... ... ... ... ... ... ...
16674 sedan 1992 manual 1.081359 150000 petrol audi no
110158 small 2002 manual -0.882763 80000 petrol opel no
227995 wagon 2012 manual -0.545490 60000 gasoline peugeot no
130757 wagon 2012 auto 0.406812 150000 gasoline volkswagen no
47062 wagon 2006 auto 1.002000 150000 gasoline volkswagen no

62699 rows × 8 columns

target_for_test_training.shape
(62699,)
features_valid_cat.dtypes
vehicle_type    category
year            category
gearbox         category
power            float64
km              category
fuel_type       category
brand           category
repaired        category
dtype: object
for column in features_for_test_training.columns:
    if features_for_test_training[column].dtype == 'object':
        features_for_test_training[column] = (features_for_test_training[column]
                                              .astype('category'))
features_for_test_training.dtypes
vehicle_type    category
year            category
gearbox         category
power            float64
km              category
fuel_type       category
brand           category
repaired        category
dtype: object
%%time
model = lgb.LGBMRegressor(objective='regression', 
                          colsample_bytree=1.0, 
                          learning_rate=0.1, 
                          n_estimators=200,
                          random_state=57)
model.fit(features_for_test_training, target_for_test_training)
CPU times: user 9.29 s, sys: 487 ms, total: 9.78 s
Wall time: 1.33 s
LGBMRegressor(n_estimators=200, objective='regression', random_state=57)
%%time
predictions = model.predict(features_test_cat)
mse(target_test, predictions)**0.5
CPU times: user 1.51 s, sys: 81 ms, total: 1.59 s
Wall time: 212 ms
1673.3438690669009

The result is better than before because of the added validation data.

Conclusions

  1. Data Features
    • The table has data on nearly 350 thousand offers during one month in 2016
    • All the missing values came from categorical data types and were replaced with objects of unknown category
    • There were a few anomalies in the data. Mainly, they were removed, but for too high engine power figures, their order was downgraded.
    • Car model, postal code, month of car registration, and dates of downloading the questionnaire and registering the offer have not been used in the training. The rest of the data has.
  2. Training
    • Before training, original encoding was used for categorical data types and scaling for quantitative data types.
    • The models considered were linear regression, random forest, and three boosting models.
  3. Results
    • All models passed the sanity check.
    • The linear regression performed the worst. Its RMSE amounted to 2181.
    • The random forest with RMSE 2006 was slightly better.
    • The best performing models turned out to be the bousting models. RMSE was 1887, 1730 and 1716 for XGBoost, CatBoost and LightGBM respectively.
    • During testing LightGBM showed even higher accuracy - 1673. This may be because the model was trained on a larger amount of data.